Computing Resources:
This project was created on a lenovo legion 7i laptop with a i9-14900HX chip, 64gb of DDR5 RAM, and an Nvidia RTX-4070 GPU with 8gb of GDDR6. The operating system initially used was Ubuntu 24.10. However, when we attempted to configure our GPU to process text data for the language model we learned that this version of Ubuntu contains the newest kernel which updates nvidia-cli and cuda drivers that are not compatible with tensorflow or pytorch needed for the text package. We moved to Ubuntu 24.04 LTS within WSL2 on Windows 11 and we were able to configure the GPU. Given that Airwars also goes back and corrects archived incidents, it is easier to just run the full process on all available incident records when needed, and the GPU cuts back on the processing time. Modeling data on a GPU for the language model cut the process time from 2.5 hours to about 20 minutes.
We use r-base 4.4.1 from anaconda and rstudio 2024.04.02. We have also used these same packages on rstudio-server via WSL2 but prefer to isolate the computing environment.
We store all of our data in a SQLite database that can also be found in our github repository
Structure of SQlite database
The scrapped data resulted in over 800 unique events stored in two tables in a SQLite database.
Table 1 contains incident metadata (e.g., unique id, incident date, web-page URL).
Table 2 stores the specific incident information such as the number of deaths, breakdown of deaths (children, adults), type of attack, and cause of death, incident coordinates and results from Nominatim, and sentiment scores for seven emotional states.
Table 3 contains the Hamas Ministry of Health (MoH) daily casualties.
The first two tables relate to each other through the unique incident identification numbers provided by Airwars. We relate the MoH table with the Airwars tables by aggregating up to the date.
# connect to database
mydb <- dbConnect(
RSQLite::SQLite(),
"~/repos/airwars_scraping_project/database/airwars_db.sqlite")
dbListTables(mydb) # print tables in database[1] "airwars_incidents" "airwars_meta" "daily_casualties"
Scraping Airwars Civilian Casualty Incidents
- The image below is an example of the Airwars incident metadata that are presented as baseball cards. This information is presented in one web-page and we start our workflow at this junction. We read the main website in Airwars that houses this information and only scrape Incident Date and Incident ID to build specific incident web-URLs that we later scrape for content.
- All of the code to conduct the scraping and processing of these data are found in our github page under code/scrape_process_incidence, which this code has been optimized and takes about 30 minutes on our laptop (32gb of RAM is sufficient) with a fast internet connection.
- Here we only explain how we pre-processed the data as it related to preparing for analysis.
- For a lot of the scraping we used selectorgadget to get the xpath and pass it through the Rvest package.
- Metadata table: We scrape the main Airwars website parse information we need to build a table containing each incident’s web-url (over 800 URLs), as seen in the example below.
# read in data tables
airwars_meta <- tbl(mydb, "airwars_meta") |>
as_tibble() |>
# convert Incident_Date to date format
mutate(Incident_Date = as_date(Incident_Date)) |>
arrange(Incident_Date)
airwars_meta |> head() |> kable()| Incident_Date | Incident_id | link |
|---|---|---|
| 2023-10-07 | ispt0019a | https://airwars.org/civilian-casualties/ispt0019a-october-7-2023/ |
| 2023-10-07 | ispt0019 | https://airwars.org/civilian-casualties/ispt0019-october-7-2023/ |
| 2023-10-07 | ispt0017 | https://airwars.org/civilian-casualties/ispt0017-october-7-2023/ |
| 2023-10-07 | ispt0011 | https://airwars.org/civilian-casualties/ispt0011-october-7-2023/ |
| 2023-10-07 | ispt0010 | https://airwars.org/civilian-casualties/ispt0010-october-7-2023/ |
| 2023-10-07 | ispt0003 | https://airwars.org/civilian-casualties/ispt0003-october-7-2023/ |
- Using the web-urls we built in the metadata table, we loop through them and scrape each URL (see below for example) to parse incident assessments as seen below.
- Each incident contains an assessment section detailing what transpired during the incident, whom was known to be involved and the victims it produced. We will use this text to get emotional scores later.
- Incidence table: Our final table contains the fields that Airwars populates for each incident. Besides parsing this information we also had to process the data, specifically, there are fields that contain ranges of kills (i.e., 3-5) or counts (i.e., 1 child, 3 women, 1 man) which we had to strip these strings into their own columns. This allows us to estimate how many children and women have been reported as civilian casualties. Our data contains 24 variables with 804 incidents reported by Airwars.
airwars_incidents <- tbl(mydb, "airwars_incidents") |>
as_tibble()
airwars_incidents |>
head() |>
select(-assessment:-surprise) |>
DT::datatable()
MoH Daily Casualties
Palestine Dataset published daily Gaza casualty counts that they take from the Hamas MoH; however, they do not make a distinction whether a casualty was a civilian or militant so their numbers should be higher than what we derive from Airwars.1
- We use the Palestine API https://data.techforpalestine.org/api/v2/casualties_daily.json and parse the JSON to save into our database after a little bit of data wrangling.
tbl(mydb, "daily_casualties") |>
as_tibble() |>
mutate(Incident_Date = lubridate::as_date(Incident_Date)) |>
head() |>
DT::datatable()
Enriching Data with Reverse geocoding
Airwars when possible includes location coordinates of where the incident took place. Although this information is contained within the assessment, Airwars standardizes it’s location with a heading under “Geolocation notes” which we were able to parse the latitude and longitude to use for geographic plotting. Of the 804 Incidents about 65% contain geographic coordinates.
- We used the Nominatim open street map API to reverse geocode our coordinates and bring back the type of location that was the location target for incidents that contained coordinates. We also save out a boundary box set of coordinates.
airwars_incidents |> select(target_type, contains("lat"), contains("long")) |> head() |> DT::datatable()
\
Sentiment Analysis
After attempting several text classification models and some question/context model we landed onj-hartmann/emotion-english-distilroberta-base because it goes beyond just a positive/negative evaluation but analysis text for Ekman’s 6 basic emotions that is common in psychological work on emotions.Moreover, this model affords us the ability to examine the emotion tone over time for these assessments.2
We get scores for each emotions, the closer to one the stronger the association, while all the scores add up to 1.
The model is trained on a balanced subset from the datasets listed above (2,811 observations per emotion, i.e., nearly 20k observations in total). 80% of this balanced subset is used for training and 20% for evaluation. The evaluation accuracy is 66% (vs. the random-chance baseline of 1/7 = 14%).
Given that we have over 800 assessments we decided to use text3 while it allows us the ability to use a laptop GPU (GTX 4070)4 to process these models for each incident. This resulted in large processing gains.
Below we print an example of these scores while we truncate the assessment text.
airwars_incidents |>
slice_sample(n=1) |>
select(assessment:surprise) |>
mutate(assessment = str_trunc(assessment, 200),
across(where(is.double), ~ round(.x, 2))) |>
DT::datatable()
Footnotes
Note. Confidence is low to moderate since the data comes from the Hamas MoH.↩︎
- The model is trained on a balanced subset from the datasets listed above (2,811 observations per emotion, i.e., nearly 20k observations in total). 80% of this balanced subset is used for training and 20% for evaluation. The evaluation accuracy is 66% (vs. the random-chance baseline of 1/7 = 14%).
An R-package for analyzing natural language with transformers from HuggingFace using Natural Language Processing and Machine Learning.↩︎
The installation for Text is tricky as the right python libraries must be installed. To compile models with the GPU, we learned that nvidia cuda drivers must be installed for version 12.1. Additionally, we could only get this to work via anaconda within Ubuntu 24.04 installed through WSL2 on Windows 11. Ubuntu 24.10 comes with a kernal that forces cuda 12.8 to be installed and did not work for us in a dual boot system.↩︎